# Multimodal Visual Question Answering
Qwen2.5 VL 72B Instruct FP8 Dynamic
Apache-2.0
FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, optimized and released by Neural Magic.
Image-to-Text
Transformers English

Q
parasail-ai
78
1
Qwen2.5 VL 3B Instruct Quantized.w4a16
Apache-2.0
The quantized version of Qwen2.5-VL-3B-Instruct, with weights quantized to INT4 and activations quantized to FP16, designed for efficient vision-text task inference.
Text-to-Image
Transformers English

Q
RedHatAI
167
1
Qwen2.5 VL 72B Instruct FP8 Dynamic
Apache-2.0
The FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, suitable for multimodal tasks.
Text-to-Image
Transformers English

Q
RedHatAI
1,837
3
Qwen2 VL 7B Instruct GGUF
Apache-2.0
A quantized version of the multimodal model based on Qwen2-VL-7B-Instruct, supporting image-text-to-text tasks with various quantization levels.
Image-to-Text English
Q
XelotX
201
1
Erax VL 7B V2.0 Preview GGUF
Apache-2.0
EraX-VL-7B-V2.0-Preview is a multimodal foundation model supporting Vietnamese, English, and Chinese, suitable for various vision-language tasks.
Image-to-Text Supports Multiple Languages
E
mradermacher
162
1
Erax VL 2B V1.5 Q4 K M GGUF
Apache-2.0
This is a multimodal visual question answering model supporting Vietnamese, English, and Chinese, converted to GGUF format based on erax-ai/EraX-VL-2B-V1.5.
Text-to-Image Supports Multiple Languages
E
Ngoac
55
0
QVQ 72B Preview GGUF
Other
QVQ-72B-Preview is a multimodal large language model based on the imatrix quantization version of llamacpp, supporting multimodal understanding and generation of images and text.
Text-to-Image English
Q
XelotX
288
0
Qwen2 VL 7B Instruct GGUF
Apache-2.0
Qwen2-VL-7B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for images and text.
Image-to-Text English
Q
second-state
195
4
Paligemma2 28b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting image and text inputs to generate text outputs.
Image-to-Text
Transformers

P
google
116
48
Paligemma2 3b Mix 224
PaliGemma 2 is an upgraded vision-language model developed by Google, combining the capabilities of Gemma 2, supporting image and text inputs to generate text outputs, suitable for various vision-language tasks.
Image-to-Text
Transformers

P
google
15.23k
28
Minicpm Llama3 V 2 5 GGUF
MiniCPM-Llama3-V-2_5 is a multimodal visual question answering model based on the Llama3 architecture, supporting both Chinese and English interactions.
Text-to-Image Supports Multiple Languages
M
gaianet
112
3
Llama 3.1 8B Vision 378
This project trained a projection module to add visual capabilities to Llama 3 using SigLIP technology, applied to the Llama-3.1-8B-Instruct model.
Image-to-Text
Transformers

L
qresearch
203
35
Yi VL 6B Hf
Other
Yi-VL-6B is a multimodal vision-language model developed by 01-AI, supporting both Chinese and English, suitable for tasks like visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

Y
BUAADreamer
55
2
Paligemma 3b Ft Science Qa 448
PaliGemma is a 3B-parameter lightweight vision-language model developed by Google, built upon SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs.
Image-to-Text
Transformers

P
google
15
2
Paligemma 3b Mix 448
PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs
Image-to-Text
Transformers

P
google
5,488
109
Paligemma 3b Ft Docvqa 896
PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.
Image-to-Text
Transformers

P
google
519
9
Paligemma 3b Ft Vqav2 448
PaliGemma is a lightweight vision-language model developed by Google, combining image understanding and text generation capabilities, supporting multilingual tasks.
Text-to-Image
Transformers

P
google
121
17
Paligemma 3b Ft Ocrvqa 448
PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.
Image-to-Text
Transformers

P
google
365
6
Firellava 13b
FireLLaVA-13B is a vision-language model trained on instruction data generated by open-source large language models, supporting image understanding and text generation tasks.
Image-to-Text
Transformers

F
fireworks-ai
59
55
Featured Recommended AI Models